Learning from Corrupted Binary Labels via Class-Probability Estimation
نویسندگان
چکیده
Many supervised learning problems involve learning from samples whose labels are corrupted in some way. For example, each label may be flipped with some constant probability (learning with label noise), or one may have a pool of unlabelled samples in lieu of negative samples (learning from positive and unlabelled data). This paper uses class-probability estimation to study these and other corruption processes belonging to the mutually contaminated distributions framework (Scott et al., 2013), with three conclusions. First, one can optimise balanced error and AUC without knowledge of the corruption parameters. Second, given estimates of the corruption parameters, one can minimise a range of classification risks. Third, one can estimate corruption parameters via a class-probability estimator (e.g. kernel logistic regression) trained solely on corrupted data. Experiments on label noise tasks corroborate our analysis. 1. Learning from corrupted binary labels In many practical scenarios involving learning from binary labels, one observes samples whose labels are corrupted versions of the actual ground truth. For example, in learning from class-conditional label noise (CCN learning), the labels are flipped with some constant probability (Angluin & Laird, 1988). In positive and unlabelled learning (PU learning), we have access to some positive samples, but in lieu of negative samples only have a pool of samples whose label is unknown (Denis, 1998). More generally, suppose there is a notional clean distribution D over instances and labels. We say a problem involves learning from corrupted Proceedings of the 32 International Conference on Machine Learning, Lille, France, 2015. JMLR: W&CP volume 37. Copyright 2015 by the author(s). binary labels if we observe training samples drawn from some corrupted distribution Dcorr such that the observed labels do not represent those we would observe under D. A fundamental question is whether one can minimise a given performance measure with respect to D, given access only to samples from Dcorr. Intuitively, in general this requires knowledge of the parameters of the corruption process that determines Dcorr. This yields two further questions: are there measures for which knowledge of these corruption parameters is unnecessary, and for other measures, can we estimate these parameters? In this paper, we consider corruption problems belonging to the mutually contaminated distributions framework (Scott et al., 2013). We then study the above questions through the lens of class-probability estimation, with three conclusions. First, optimising balanced error (BER) as-is on corrupted data equivalently optimises BER on clean data, and similarly for the area under the ROC curve (AUC). That is, these measures can be optimised without knowledge of the corruption process parameters; further, we present evidence that these are essentially the only measures with this property. Second, given estimates of the corruption parameters, a range of classification measures can be minimised by thresholding corrupted class-probabilities. Third, under some assumptions, these corruption parameters may be estimated from the range of the corrupted class-probabilities. For all points above, observe that learning requires only corrupted data. Further, corrupted class-probability estimation can be seen as treating the observed samples as if they were uncorrupted. Thus, our analysis gives justification (under some assumptions) for this apparent heuristic in problems such as CCN and PU learning. While some of our results are known for the special cases of CCN and PU learning, our interest is in determining to what extent they generalise to other label corruption problems. This is a step towards a unified treatment of these problems. We now fix notation and formalise the problem. Learning from Corrupted Binary Labels via Class-Probability Estimation 2. Background and problem setup Fix an instance space X. We denote byD some distribution over X × {±1}, with (X,Y) ∼ D a pair of random variables. Any D may be expressed via the class-conditional distributions (P,Q) = (P(X | Y = 1),P(X | Y = −1)) and base rate π = P(Y = 1), or equivalently via marginal distribution M = P(X) and class-probability function η : x 7→ P(Y = 1 | X = x). When referring to these constituent distributions, we write D as DP,Q,π or DM,η. 2.1. Classifiers, scorers, and risks A classifier is any function f : X→ {±1}. A scorer is any function s : X → R. Many learning methods (e.g. SVMs) output a scorer, from which a classifier is formed by thresholding about some t ∈ R. We denote the resulting classifier by thresh(s, t) : x 7→ sign(s(x)− t). The false positive and false negative rates of a classifier f are denoted FPR(f),FNR(f), and are defined by P X∼Q (f(X) = 1) and P X∼P (f(X) = −1) respectively. Given a function Ψ: [0, 1] → [0, 1], a classification performance measure ClassΨ : {±1} → [0, 1] assesses the performance of a classifier f via (Narasimhan et al., 2014) ClassΨ(f) = Ψ(FPR (f),FNR(f), π). A canonical example is the misclassification error, where Ψ: (u, v, p) 7→ p · v+ (1− p) · u. Given a scorer s, we use ClassΨ(s; t) to refer to Class D Ψ(thresh(s, t)). The Ψ-classification regret of a classifier f : X→ {±1} is regretΨ(f) = Class D Ψ(f)− inf g : X→{±1} ClassΨ(g). A loss is any function ` : {±1} ×R→ R+. Given a distribution D, the `-risk of a scorer s is defined as L` (s) = E (X,Y)∼D [`(Y, s(X))] . (1) The `-regret of a scorer, regret` , is as per the Ψ-regret. We say ` is strictly proper composite (Reid & Williamson, 2010) if argmins L` (s) is some strictly monotone transformation ψ of η, i.e. we can recover class-probabilities from the optimal prediction via the link function ψ. We call class-probability estimation (CPE) the task of minimising Equation 1 for some strictly proper composite `. The conditional Bayes-risk of a strictly proper composite ` is L` : η 7→ η`1(ψ(η)) + (1 − η)`−1(ψ(η)). We call ` strongly proper composite with modulus λ if L` is λstrongly concave (Agarwal, 2014). Canonical examples of such losses are the logistic and exponential loss, as used in logistic regression and AdaBoost respectively. Quantity Clean Corrupted Joint distribution D Corr(D,α, β, πcorr) or Dcorr Class-conditionals P,Q Pcorr, Qcorr Base rate π πcorr Class-probability η ηcorr Ψ-optimal threshold tΨ t D corr,Ψ Table 1. Common quantities on clean and corrupted distributions. 2.2. Learning from contaminated distributions Suppose DP,Q,π is some “clean” distribution where performance will be assessed. (We do not assume that D is separable.) In MC learning (Scott et al., 2013), we observe samples from some corrupted distribution Corr(D,α, β, πcorr) over X × {±1}, for some unknown noise parameters α, β ∈ [0, 1] with α + β < 1; where the parameters are clear from context, we occasionally refer to the corrupted distribution as Dcorr. The corrupted classconditional distributions Pcorr, Qcorr are Pcorr = (1− α) · P + α ·Q Qcorr = β · P + (1− β) ·Q, (2) and the corrupted base rate πcorr in general has no relation to the clean base rate π. (If α+β = 1, then Pcorr = Qcorr, making learning impossible, whereas if α+ β > 1, we can swap Pcorr, Qcorr.) Table 1 summarises common quantities on the clean and corrupted distributions. From (2), we see that none of Pcorr, Qcorr or πcorr contain any information about π in general. Thus, estimating π from Corr(D,α, β, πcorr) is impossible in general. The parameters α, β are also non-identifiable, but can be estimated under some assumptions on D (Scott et al., 2013). 2.3. Special cases of MC learning Two special cases of MC learning are notable. In learning from class-conditional label noise (CCN learning) (Angluin & Laird, 1988), positive samples have labels flipped with probability ρ+, and negative samples with probability ρ−. This can be shown to reduce to MC learning with α = π−1 corr · (1− π) · ρ− , β = (1− πcorr) · π · ρ+, (3) and the corrupted base rate πcorr = (1−ρ+)·π+ρ−·(1−π). (See Appendix C for details.) In learning from positive and unlabelled data (PU learning) (Denis, 1998), one has access to unlabelled samples in lieu of negative samples. There are two subtly different settings: in the case-controlled setting (Ward et al., 2009), the unlabelled samples are drawn from the marginal distribution M , corresponding to MC learning with α = 0, β = π, Learning from Corrupted Binary Labels via Class-Probability Estimation and πcorr arbitrary. In the censoring setting (Elkan & Noto, 2008), observations are drawn from D followed by a label censoring procedure. This is in fact a special of CCN (and hence MC) learning with ρ− = 0. 3. BER and AUC are immune to corruption We first show that optimising balanced error and AUC on corrupted data is equivalent to doing so on clean data. Thus, with a suitably rich function class, one can optimise balanced error and AUC from corrupted data without knowledge of the corruption process parameters. 3.1. BER minimisation is immune to label corruption The balanced error (BER) (Brodersen et al., 2010) of a classifier is simply the mean of the per-class error rates, BER(f) = FPR(f) + FNR(f) 2 . This is a popular measure in imbalanced learning problems (Cheng et al., 2002; Guyon et al., 2004) as it penalises sacrificing accuracy on the rare class in favour of accuracy on the dominant class. The negation of the BER is also known as the AM (arithmetic mean) metric (Menon et al., 2013). The BER-optimal classifier thresholds the class-probability function at the base rate (Menon et al., 2013), so that: argmin f : X→{±1} BER(f) = thresh(η, π) (4) argmin f : X→{±1} BERcorr(f) = thresh(ηcorr, πcorr), (5) where ηcorr denotes the corrupted class-probability function. As Equation 4 depends on π, it may appear that one must know π to minimise the clean BER from corrupted data. Surprisingly, the BER-optimal classifiers in Equations 4 and 5 coincide. This is because of the following relationship between the clean and corrupted BER. Proposition 1. Pick any D and Corr(D,α, β, πcorr). Then, for any classifier f : X→ {±1}, BERcorr(f) = (1− α− β) · BER(f) + α+ β 2 , (6) and so the minimisers of the two are identical. Thus, when BER is the desired performance metric, we do not need to estimate the noise parameters, or the clean base rate: we can (approximately) optimise the BER on the corrupted data using estimates η̂corr, π̂corr, from which we build a classifier thresh(η̂corr, π̂corr). Observe that this approach effectively treats the corrupted samples as if they were clean, e.g. in a PU learning problem, we treat the unlabelled samples as negative, and perform CPE as usual. With a suitably rich function class, surrogate regret bounds quantify the efficacy of thresholding approximate classprobability estimates. Suppose we know the corrupted base rateπcorr, and suppose that s is a scorer with low `-regret on the corrupted distribution for some proper composite loss ` with link ψ i.e. ψ−1(s) is a good estimate of ηcorr. Then, the classifier resulting from thresholding this scorer will attain low BER on the clean distribution D. Proposition 2. Pick any D and Corr(D,α, β, πcorr). Let ` be a strongly proper composite loss with modulus λ and link function ψ. Then, for any scorer s : X→ R, regretBER(f) ≤ C(πcorr) 1− α− β · √ 2 λ · √ regretcorr ` (s), where f = thresh(s, ψ(πcorr)) and C(πcorr) = (2 · πcorr · (1− πcorr)). Thus, good estimates of the corrupted class-probabilities let us minimise the clean BER. Of course, learning from corrupted data comes at a price: compared to the regret bound obtained if we could minimise ` on the clean distribution D, we have an extra penalty of (1 − α − β)−1. This matches our intuition that for high-noise regimes (i.e. α + β ≈ 1), we need more corrupted samples to learn effectively with respect to the clean distribution; confer van Rooyen & Williamson (2015) for lower and upper bounds on sample complexity for a range of corruption problems. 3.2. AUC maximisation is immune to label corruption Another popular performance measure in imbalanced learning scenarios is the area under the ROC curve (AUC). The AUC of a scorer, AUC(s), is the probability of a random positive instance scoring higher than a random negative instance (Agarwal et al., 2005): E X∼P,X′∼Q [ Js(X) > s(X′)K + 1 2 Js(X) = s(X′)K ] . We have a counterpart to Proposition 1 by rewriting the AUC as an average of BER across a range of thresholds ((Flach et al., 2011); see Appendix A.5): AUC(s) = 3 2 − 2 · EX∼P [BER(s; s(X))]. (7) Corollary 3. Pick any DP,Q,π and Corr(D,α, β, πcorr). Then, for any scorer s : X→ R, AUCcorr(s) = (1− α− β) ·AUC(s) + α+ β 2 . (8) Thus, like the BER, optimising the AUC with respect to the corrupted distribution optimises the AUC with respect Surrogate regret bounds may also be derived for an empirically chosen threshold (Kotłowski & Dembczyński, 2015). Learning from Corrupted Binary Labels via Class-Probability Estimation to the clean one. Further, via recent bounds on the AUCregret (Agarwal, 2014), we can show that a good corrupted class-probability estimator will have good clean AUC. Corollary 4. Pick any D and Corr(D,α, β, πcorr). Let ` be a strongly proper composite loss with modulus λ. Then, for every scorer s : X→ R, regretAUC(s) ≤ C(πcorr) 1− α− β · √ 2 λ · √ regretcorr ` (s), where C(πcorr) = (πcorr · (1− πcorr)). What is special about the BER (and consequently the AUC) that lets us avoid estimation of the corruption parameters? To answer this, we more carefully study the structure of ηcorr to understand why Equation 4 and 5 coincide, and whether any other measures have this property. Relation to existing work For the special case of CCN learning, Proposition 1 was shown in Blum & Mitchell (1998, Section 5), and for case-controlled PU learning, in (Lee & Liu, 2003; Zhang & Lee, 2008). None of these works established surrogate regret bounds. 4. Corrupted and clean class-probabilities The equivalence between a specific thresholding of the clean and corrupted class-probabilities (Equations 4 and 5) hints at a relationship between the two functions. We now make this relationship explicit. Proposition 5. For any DM,η and Corr(D,α, β, πcorr), (∀x ∈ X) ηcorr(x) = T (α, β, π, πcorr, η(x)) (9) where, for φ : z 7→ z 1+z , T (α, β, π, πcorr, t) is given by φ ( πcorr 1− πcorr · (1− α) · 1−π π · t 1−t + α β · 1−π π · t 1−t + (1− β) )
منابع مشابه
Learning from Corrupted Binary Labels via Class- Probability Estimation Classification with Corrupted Binary Labels
550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 6...
متن کاملLearning from Binary Labels with Instance-Dependent Corruption
Suppose we have a sample of instances paired with binary labels corrupted by arbitrary instanceand label-dependent noise. With sufficiently many such samples, can we optimally classify and rank instances with respect to the noise-free distribution? We provide a theoretical analysis of this question, with three main contributions. First, we prove that for instance-dependent noise, any algorithm ...
متن کاملLearning from Corrupted Binary Labels via Class-Probability Estimation (Full Version)
Many supervised learning problems involve learning from samples whose labels are corrupted in some way. For example, each label may be flipped with some constant probability (learning with label noise), or one may have a pool of unlabelled samples in lieu of negative samples (learning from positive and unlabelled data). This paper uses class-probability estimation to study these and other corru...
متن کاملImage alignment via kernelized feature learning
Machine learning is an application of artificial intelligence that is able to automatically learn and improve from experience without being explicitly programmed. The primary assumption for most of the machine learning algorithms is that the training set (source domain) and the test set (target domain) follow from the same probability distribution. However, in most of the real-world application...
متن کاملLearning with Noisy Labels
In this paper, we theoretically study the problem of binary classification in the presence of random classification noise — the learner, instead of seeing the true labels, sees labels that have independently been flipped with some small probability. Moreover, random label noise is class-conditional — the flip probability depends on the class. We provide two approaches to suitably modify any giv...
متن کامل